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Abstract 

The adaptive BDDC method is extended to the selection of face constraints in three 
dimensions. A new implementation of the BDDC method is presented based on a global 
formulation without an explicit coarse problem, with massive parallelism provided by a 
multifrontal solver. Constraints are implemented by a projection and sparsity of the 
projected operator is preserved by a generalized change of variables. The effectiveness of 
the method is illustrated on several engineering problems. 

Keywords: parallel algorithms, domain decomposition, iterative substructuring, BDDC, 
adaptive constraints 



1. Introduction 

The Balancing Domain Decomposition by Constraints (BDDC) was developed by 
Dohrmann [6] as a primal alternative to the Finite Element Tearing and Interconnecting 
- Dual, Primal (FETI-DP) by Farhat et al. [7j. Both methods use constraints to impose 
equality of new "coarse" variables on substructure interfaces, such as values at substructure 
corners or weighted averages over edges and faces. Primal variants of the FETI-DP were 
also independently proposed by Cros |1] and by Fragakis and Papadrakakis [lOj. It has 
been shown in [271 EZ] that these methods are in fact the same as BDDC. Polylogarithmic 
condition number bounds for FETI-DP were first proved in [29] and generalized to the case of 
coefficient jumps between substructures in [H]. The same bounds were obtained for BDDC 
in [231 121] • A proof that the eigenvalues of the preconditioned operators of both methods are 
actually the same except for the eigenvalues equal to one was given in [21] and then simplified 
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in [21[2nil2Z!- FETI-DP, and, equivalently, BDDC are quite robust. It can be proved that the 
condition number remains bounded even for large classes of subdomains with rough interfaces 
in 2D [124l39j as well as in many cases of strong discontinuities of coefficients, including some 
configurations when the discontinuities cross substructure boundaries |311 [32]. However, 
the condition number deteriorates in many situations of practical importance and a better 
selection of constraints is desirable. Enriching the coarse space so that the iterations run in 
a subspace devoid of "difficult" modes has been a successful trick in iterative substructuring 
methods used, e.g., in the development of BDD and FETI for plates from the base BDD 
and FETI methods UHl IHl |30]. Methods that build a coarse space adaptively from local 
eigenvalue calculations were also devised in other (though related) contexts [3l [9l [211 1221 [33] . 
Adaptive enrichment for BDDC and FETI-DP was proposed in [211 |2S!, with the added 
coarse functions built from eigenproblems based on adjacent pairs of substructures in 2D. 
The adaptive method, however, was formulated in terms of FETI-DP operators, and it was 
quite comphcated. 

Here, we develop the adaptive algorithm directly in terms of BDDC operators, resulting 
in a much simpler formulation and implementation. Of course, the algorithm still allows 
a translation into the language of the FETI-DP. We then extend the construction from 
[251 [26] to 3D. We find that the heuristic eigenvalue-based estimates still work reasonably 
well and that our adaptive approach can result in the concentration of computational work 
in a small troublesome part of the problem, which leads to a good convergence behaviour at 
a small added cost. 

We also develop a new implementation framework that operates on global matrices, builds 
no explicit coarse problem, and gets much of its parallelism through the direct solver used 
for solution of an auxiliary decoupled system. To preserve sparsity, we use a variant of the 
change of variables from [20], extended to an arbitrary number of constraints. Our current 
parallel implementation is built on top of the multifrontal massively parallel sparse direct 
solver MUMPS [T], motivated also by an earlier implementation of the BDDC preconditioner 
based on the frontal solver [33]. 

The rest of the paper is organised as follows. In Section [2| we establish the notation and 
review the BDDC algorithm in a form suitable for our purposes. In Section |3| we describe 
the adaptive method. Section |4] then describes the implementation on top of a massively 
parallel direct solver. Section [5] presents the generalized change of variables to preserve 
sparsity. Section [6] describes some further details of the implementation. Numerical results 
are presented in Section [7j Section |8] contains the summary and concluding remarks. 

Some of results in this paper were presented in the thesis |36j . 

2. Notation, substructuring, and BDDC 

To establish notation, we first briefly review standard substructuring concepts and state 
the BDDC method in a form suitable for our purposes. The setting and notation here is 
compatible with [28], with some additions. See, e.g., [35l|3H] for more details about iterative 
substructuring and [IQ |23l [281 [351 [38] for BDDC. 
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Consider an elliptic boundary value problem defined on a bounded domain f2 C 
and discretized by conforming finite elements. The domain Q is decomposed into 
nonoverlapping suhdomains fij, i = 1,...N, also called substructures, so that each 
substructure fli is a union of finite elements. Each node is associated with one degree 
of freedom in the scalar case, and with 3 displacement degrees of freedom in the case of 
linear elasticity. The nodes contained in more than one substructure are called the interface, 
denoted by F, and Fj = F fl f2j is the interface of substructure fij. The interface F may also 
be classified as the union of three different types of nonoverlapping sets: faces, edges, and 
corners. We will adopt here the following simple definition. A face contains all nodes shared 
solely by one pair of subdomains, an edge contains all nodes shared by same set of more than 
two subdomains, and a corner is a degenerate edge with only one node; for a more general 
definition see e.g. [13]. Edges and faces are also called globs. 

We identify finite element functions with the vectors of their coefficients in the standard 
finite element basis. These coefficients are also called variables or degrees of freedom. We 
also identify linear operators with their matrices, in bases that will be clear from the context. 

The space of all (vectors of the degrees of freedom of) finite element functions on 
subdomain Qi is denoted by Wi, and let 



W = WiX---xW, 



N- 



The space W is equipped with the standard M" basis and the Euclidean inner product 
{w,v) = w^v. For a symmetric positive semidefinite matrix M, {u,v)j^j = {Mu,v), and 
IImIIj^^ = {Mu, u)^^"^ . 

Let Ai : Wi ^ Wi be the local substructure stiffness matrix, obtained by the subassembly 
of element matrices only in substructure Qi. The matrices Ai are symmetric positive 
semidefinite for an elliptic problem. We can write vectors and matrices in the block form 



w 



Wi 



Wn 



weW, A 



Ai 



A 



N 



■.w ^W. 



(2) 



Now let f/ C W be the space of all functions from W that are continuous across 
substructure interfaces. We are interested in solving the problem 



ueU: {Au,v) = {f,v), WveU, 



(3) 



where / e FT is a given right-hand side. Vectors from U are called vectors of global degrees 
of freedom, while vectors from Wi are called local. The space U is equipped with the basis of 
0-1 vectors with one basis vector for each global degree of freedom. The basis vectors have 
Is in the places where the global degree of freedom coincides with a local one. The matrix 



R:U ^W 



(4) 



formed from these basis vectors as columns, is the familiar global-to-local mapping that 
restricts the global vectors of degrees of freedom to local degrees of freedom on each Qi. 
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Thus, R AR is the global stiffness matrix, and (|3j) is equivalent to the assembled system 

R^ARv = R^f. (5) 

The matrix R is also the matrix of the canonical embedding U G W in the given bases. 

Denote hy Uj (Z W the space of all (vectors of) finite element functions with nonzero 
values only in the interiors of substructures fli. Then Uj G U, and the space W is decomposed 
as the y4-orthogonal direct sum 

W = Ui(B Wh, Uj ±a Wh, (6) 

where the functions from Wh are called discrete harmonic. Such functions are fully 
determined by values of degrees of freedom at the interface, and they have minimal energy on 
every subdomain. Therefore, in a computer implementation, only interface values of discrete 
harmonic functions need to be stored. 

The y4-orthogonal projection onto Uj is denoted by 

P:W -^Ui. (7) 

For w G W, {I — P)w is the discrete harmonic extension from the values of w on the 
substructure boundaries. The evaluation of Pw consists of the solution of N independent 
Dirichlet problems, one in each substructure. 

The space of all discrete harmonic functions from W that are continuous at interface is 
denoted by W. We have 

W = WHnU = {I - P)U, (8) 
and the A-orthogonal decomposition 

U = Ui®W, Ui W. (9) 

The solution t> e [/ of problem ^ is split as 

Rv = u + w, ugUi.wgW. (10) 

Solving for the interior component u G Uj decomposes into independent Dirichlet 
problems. We are interested in finding the discrete harmonic component w G W , which 
is the solution of the reduced problem 

wGW: {Aw,z) = {f,z) , VzeW. (11) 

We further need an averaging operator 

E:W ^U. (12) 

The operator E replaces the variables on the interface by their averages (arithmetic or 
weighted) from all adjacent subdomains, and it preserves variables in the interiors of 
substructures. The operator E is a projection from W onto U. Then the operator 

{I-P)E:W^W (13) 
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is a projection from W onto W. Its evaluation consists of averaging between the 
substructures, followed by the discrete harmonic extension from the substructure boundaries. 
Also, note that 

{I - {I - P)E)w = {I - P) (J - E) w, e Wh, (14) 

since Pw = if w G Wh- 

Proper weights (e.g. proportional to the substructure stiffness) in the averaging given 
by E are important for the performance of BDDC (as well as other iterative substructuring 
methods) independent of different stiffness of substructures [23] . 

The BDDC preconditioner is characterized by a selection of coarse degrees of freedom, 
such as values at corners and averages over edges or faces. The action of the BDDC 
preconditioner is then defined in the space given by the requirement that the coarse degrees 
of freedom on adjacent substructures coincide, which is enforced in the algorithms by 
constraints. So, the design of the BDDC preconditioner is characterized by a selection 
of an intermediate space W satisfying these constraints, 

W CW cWh. (15) 
The BDDC then consists of preconditioned conjugate gradients (PCG) applied to the 



problem (11) with the preconditioner 



Mbddc ■.r^u = {I-P)Ew, w e W : {Aw, z) = (r, (/ - P) Ez) , G M/, (16) 

where r is the residual in the PCG method. The following condition number bound for 
BDDC will play an essential role in our design of the adaptive method. 

Theorem 1 ( j24] ). The eigenvalues of the preconditioned operator of the BDDC method 
satisfy 1 < A < ubddc, where 

00 BDDC = sup 2 • (17) 

The BDDC enforces the equality of corner coarse degrees of freedom directly by using the 
space W^, consisting of all functions where the local degrees of freedom on the substructure 
corners coincide. Then 

U CW'CW. (18) 

Just like U, space is equipped with a basis consisting of 0-lvectors. The basis vector 
corresponding to a corner degree of freedom has Is in the places where the global degree of 
freedom coincides with the corresponding substructure degree of freedom. The global-to-local 
matrix 

R^-.W^^ W, (19) 

formed from these basis vectors as columns, is the matrix of the canonical embedding 
C W, and 

A"" = R'^AR". (20) 



5 



is the stiffness matrix assembled at the subdomain corners only (Figure [T]). 

We require that there are sufficiently many corner constraints, which leads to the following 
assumption. 

Assumption 1. The matrix A is positive definite on W^. 

Denote by the space all of discrete harmonic functions in W^, 

w'' = w''n Wh. (21) 

Then 

W ^Wh, (22) 

and we construct the space W by enforcing the remaining constraints weakly by a matrix 

W = : Dw = (23) 

Each row of D defines one constraint. We require that the constraints are satisfied by all 
functions that are continuous across the interfaces, 

Dw = 0, Vw e U. (24) 



Note that (24) implies that W C W, and that the constraints Dw = involve boundary 
variables only. The adaptive algorithm will construct such matrix D. 



Lemma 1. The BDDC preconditioner (16) satisfies 

Mbddc ■.r^u = {I-P) ER'wc. (25) 

For some \, it holds 

A^w, + D'^^X = R^'^E^ {I - Pfr, , . 

where 

= DR" (27) 
differs from D only by omitting some zero columns corresponding to corners. 

Proof. The saddle point problem ( [26| is equivalent to the constrained minimization 

^ {Aw, w) - (r, (/ - P) Ew) min subject to w G Dw = 0, (28) 



with w = R^Wc- Let w be a solution of (28) and z G f//. Since w is optimal with respect 
to variation z, and (/ — P) z = 0, we have {Aw, z) = 0. Thus, w is discrete harmonic. It 
follows that (28) is equivalent to 

^ {Aw, w) - {r, (/ - P) Ew) min subject to w e W, Dw = 0, (29) 



which, by (23), is the same as (16). 

Finally, matrix DR"^ differs from D only by omitting zero columns because the constraints 
Dw = do not involve corners. ■ 
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Remark 1. In practice, the computation of {I — P)^ r can he omitted, because r = Ae, 
where the error e is discrete harmonic, and then 

{P'^r, z) = {P'^Ae, z) = {Ae, Pz) = (e, z) ^ = 0, \/z e Uj, (30) 

thus P^r = 0, that is, r = in the interiors. The condition that the error e is discrete 
harmonic is preserved in the iteration by induction, and the initial error can be made discrete 
harmonic by a suitable choice of initial approximation for the reduced problem (e.g., zero). 

3. Adaptive selection of constraints 

We first briefly review tlie principle of the adaptive method from f26], in a form suitable for 
our purposes. The condition number bound ubddc from Theorem [T] equals to the maximum 
eigenvalue Ai of the associated generalized eigenvalue problem 

weW: {{I -{I -P)E)w,{I ~{I -P)E)z)^ = X{w,z)^, VzeW. (31) 

The following statement is well known from linear algebra, e.g. [5l Theorem 5.2]. 

Lemma 2 (Courant-Fisher-Weyl minimax principle). Let c(-, ) be symmetric posi- 
tive semidefinite bilinear form on vector space V of dimension n and b (■, ■) symmetric posi- 
tive definite bilinear form on V. Then the generalized eigenvalue problem 

w eV : c{w,u) = Xb{w,u) , \fu e V (32) 

has n linearly independent eigenvectors Wk and the corresponding eigenvalues are real 

and nonnegative and the eigenvectors are stationary points of the Rayleigh quotient 

c{w,w) /h{w,w), with the stationary values equal to Aj. Order Ai > A2 > . . . > A„ > 0. 
Then, for any subspace Vk G V of dimension n — k, 

c{w,w) 
max — > Afe+i, 

weVk,Wy^O [W, W) 

with equality if 

Vk = {w eV : c{we, w) = 0, V£ = 1, . . . , A;} . (33) 



Since the bilinear form on the left-hand side of (31) is symmetric positive semideflnite 
and the bilinear form on the right-hand side is symmetric positive deflnite, Lemma [2] applies 
and leads to the following corollary. 



Corollary 1. The generalized eigenvalue problem (31) has eigenvalues Ai > A2 > . . . > 
An > 0. Denote the corresponding eigenvectors wg. Then, for any k = 1, . . . ,n — 1, and any 
linear functional Lg on W , i = 1, . . . , k, 

max! Le{w) = 0, = 1, . . . , I > A^+i, (34) 

I ll'^^IU J 

with equality if 

L, (w) = ((/ - (/ - P) E) We, (I -{I- P) E) w)^ . (35) 
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The next lemma shows that the added constraints (w) = satisfy the compatibihty 
condition (24). 



Lemma 3. The constraints (w) = 0, with given by (35), are satisfied for any w G U. 



Proof. From (|I4|), {I - {I - P) E) = {I - P){I - E). For any w e U , {I - E) w = 0, 
because E is a projection on U. ■ 

It follows that the optimal decrease of the condition number bound (17) can be achieved 
by adding the rows dj defined by djw = Li (w) to the constraint matrix D in the definition 
ofW 

However, solving the global eigenvalue problem (31) is expensive, and the vectors di are 
not of the form suitable for substructuring, i.e., each di with nonzeros at one glob only. For 
these reasons, we replace (31) by a collection of local problems, each defined by considering 
only two adjacent subdomains fli and flj at a time. Here, subdomains are considered adjacent 



if they share an edge in 2D, or a face in 3D (Figure |2|. All quantities associated with such 
pair will be denoted by the subscript ^ . Using also (14), the generalized eigenvalue problem 
(31) becomes 



Wij e Wij : 



A {wij,Zij) 



(36) 



Assumption 2. The corner constraints are already sufficient to prevent relative rigid body 
motions of any pair of adjacent substructures, so 



Wwij G Wij : AijWij 







(/ - Eij)w,j 



0, 



(37) 



i.e., the corner degrees of freedom are sufficient to constrain the rigid body modes of the 
two substructures into a single set of rigid body modes, which are continuous across the 
interface Tij. 



The maximal eigenvalue Uij of (36) is finite due to Assumption |2| and we define the 
heuristic condition number indicator 



max{uij : Qi and Qj are adjacent} . 



(3^ 



Considering two adjacent subdomains fli and Qj only, we get the added constraints 
Lf (w) = from (1351) as 



((/ - P,,) (/ - Eij) w,,,e, (/ - P^J) (/ - E,,) w)^ 



0, V£ 



h- ■ 

1 "'JJ 5 



(39) 



where Wiji are the eigenvectors corresponding to the kij largest eigenvalues from (36) 



Algorithm 1 (Adaptive BDDC). Find the smallest kij for each pair of adjacent 
substructures Vti and Vtj to guarantee that Xij^kij+i ^ where t is a given tolerance, and add 
the constraints (39) to the definition ofW. 
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The adaptive BDDC method assures that the condition number indicator uj < t with 
the minimum number of added constraints. It was presented in [26] starting from corner 
constraints only, formulated in terms of FETI-DP, and the result translated to BDDC. We 
extend the method to the case a general space W and give a much simpler implementation 
in BDDC directly. 

To formulate a numerical al gori thm, we need to write the generalized eigenvalue problem 



(36) and the added constraints (39) in terms of matrices and vectors. Consider the space Wij 
given by the corner constraints and an initial constraint matrix D^j devised from an initial 
global matrix V^. Recall that W[j is the space of functions from Wij that are continuous at 
corners, i?^^ : W^j — )■ Wij is the identity embedding, A'lj = R^jAijR^j is the matrix assembled 



at the corners, and V^j = DijRI-. Let 



li,, = I ~ D'r; {D'lP'rf)-' Dl^ (40) 
be the orthogonal projection onto nuUD? , Ilj, : W^f — )■ Wij. The initial constraint matrix 



D^- can be empty; then Iljj = /. The generalized eigenvalue problem (31) now becomes 



n,^ (/ - Ei^y" St, (/ - E,,) Ii,,w,, = K.U.jSm^.Wi,, (41) 



where 
Since 



St, = (I - P.,f A^AI - P.,) . (42) 



null n^.-^^.n,, C null n,,- (/ - Eijf Stj (/ - Eij) Hij, (43) 
the eigenvalue problem (41) reduces in the factorspace modulo nulHIjjS'f Ily to a problem 



with the operator on the right-hand side positive definite. In some computations, we have 
used the subspace iteration method LOBPCG [15] to find the dominant eigenvalues and 
their eigenvectors. The LOBPCG iterations then run in the factorspace. To use standard 



eigenvalue solvers, (41) may be converted to a matrix eigenvalue problem by penalizing the 



components in nullDf^ and rigid body modes, as already described in [26] 



It follows from the matrix form of the eigenvalue problem (43), that the constraints to 
be added are 

Uj,, {wij) = wl^,Ilij (/ - Eij)" Stj (/ - Eij) UijWij = 0, (44) 

where Wij is the restriction of w to the pair of subdomains. That is, we wish to add to the 
constraint matrix D the rows 

d,,,e = wl,U,, (/ - E,,f St, {I- E,,) U,,. (45) 

These rows are extended to the size of global matrix D simply by zeros. 

Proposition 1. The vectors dij^i, constructed for a domain consisting of only two 
substructures Vli and Qj, have matching entries on the interface between the two 
substructures, with opposite signs. 
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Proof. Consider the vector w G W that has two entries equal to 1, corresponding to a degree 
of freedom on the interface, and all other entries equal to 0. Using the definition of dij^e (45) 
and Lemma [sj we get dij^iWij = Lij^i (wij) = 0. The proof is concluded by taking Wij of this 
form for arbitrary degree of freedom, all of which satisfy relation (44). ■ 



In 2D, one can simply add rows (45) to the constraint matrix D, which is equivalent 
to the method from In 3D, unfortunately, such rows would generally have nonzero 

entries over all of the interface of f2j and flj , including the edges (where fli and Qj intersect 
other substructures). Consequently, these rows would couple several globs together, and the 
matrix D'^D'^^ would be in general no longer block diagonal with one block per glob. To 
preserve the block diagonal structure, we have to split each dijj into one row that contains 
the nonzero entries of the face, and one row for each edge that contains the nonzero entries of 
that edge. From Proposition [T| it follows that these split constraints satisfy the compatibility 



condition (24), and thus the space W is well defined. 

Remark 2. In the computations reported in Section we drop the adaptively generated 
edge constraints in 3D. Then it is no longer guaranteed that the condition number indicator 
u < T. However, the method is still observed to perform well. 

4. Parallel framework with global matrices and on top of multifrontal solver 

The main purpose of BDDC, just like any other iterative substructuring method, is to 
split the problem into subproblems, which are solved independently on separate nodes in 
a multiprocessor system. Therefore, the usual implementation results in independent local 
problems on the spaces Wi and a small coarse problem [HI 123]. Parallel implementation 
then requires a fair amount of custom coding. To reduce the amount of new code, 
a BDDC implementation that uses specially crafted calls to a frontal solver to compute 
almost all quantities on the substructures was developed [31]. However, the frontal solver 
implementation needs to construct a coarse problem, and the programmer needs to handle the 
parallelism explicitly. Fortunately, highly efficient massively parallel direct solvers exist, and 
an implementation based on such solver may avoid dealing with parallel issues completely. 

When there are only corner constraints, i.e. D'^ is empty, the BDDC preconditioner 



(|25j)-(l27j) reduces to 

Mbddc : r ^ (/ - P) ER' {A')'^ R^^E^ (/ - P)^ r. (46) 

All coupling between substructures in the matrix A'^ is concentrated at corner degrees of 
freedom, while most computational work rests inside the subdomains, and an efficient solver 
should be able to perform it independently in parallel. Our implementation is based on 
the multi-frontal solver MUMPS [1] , and numerical results show that this solver can indeed 
handle matrices of this type reasonably well. Our MATLAB implementation also uses global 
matrices. In both implementations, expressions involving sparse matrices are evaluated using 
vectors in the space just as in the formulas here. 

However, if there are any constraints in the globs, one has to solve the constrained system 



(26), and MUMPS cannot do this directly. Thus, we will transform (26) to a symmetric 



positive definite system which can be solved by MUMPS. 
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One way to solve system (26) is to introduce the orthogonal projection 11 onto the 



nuUspace of V^, which is given by 

n = / - (47) 

As opposed to the localised analogue Uij used in the previous section, IT : W is a global 

operator. Due to the block structure of D'^, where each block corresponds to a different 
glob, and because each degree of freedom belongs to at most one glob by definition, the 
construction of 11 can be performed in parallel. 

Using projection 11, the saddle point problem ( [26] ) is equivalent to 

UA^Uwc = UR'^^E'^ (/ - Pf r, Wc G null D". (48) 

However, the operator IIA'^n is singular for nontrivial D'^, so we solve instead a modified 
system 

[UA^U + t{I - n)] Wc = ni?"^E^ (/ - Pf r, (49) 
where t > is a stabilization parameter, e.g. chosen as the maximal diagonal entry in A'^. 



Now, the operator IIA^n + t{I — 11) is regular, while the solutions of the systems (26) and 



(49) are the same. 

The projection 11 enforces constraints that couple all degrees of freedom on corresponding 
globs. For this reason, the action of 11 introduces new off-diagonal elements (called fill-in) 
in the projected matrix IIA^n + t{I — 11). This is illustrated in Figure |3| where new dense 
off-diagonal blocks between globs appear. Because of these blocks, the performance of sparse 
direct solvers would seriously deteriorate. A sufficient remedy for this issue is proposed in 
the next section. 



5. Generalized change of variables 



To reduce the fill-in corresponding to the enforcing of the constraints following (47)-(48) 



we revisit and generalize the change of variables proposed in [HI |20] . On each substructure 
i, consider first the change of variables by the transformation 



HiWi, Hi 



U V 
/ 



(50) 



That is, the averages (given as rows of a matrix H^^^ ~ [ ^ ^ ] ) beginning of 

the vector replacing the variables in Wi. The remaining variables in Wi are unchanged. 

We assume that the vectors of weights in the averages are linearly independent, that is, H^^^ 
has a full row rank. While this assumption guarantees that there exists a square submatrix 
of iff^^ consisting of linearly independent columns, this does not necessarily need to be the 
matrix U, and so the inverse transformation may not exist. To correct this, we compute 
the QR decomposition of H!^"^^ with column pivoting to choose which variables in Wi will be 
replaced by the averages. Decompose 



H^^^ = Q[U V]K, 



(51) 
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where Q is an orthogonal matrix, U is an upper triangular matrix, and is a permutation 
matrix. We now define the generalized change of variables by 



HiWi, Hi 



U V 
I 



K. 



Now H;^ ^ exists, and the inverse change of variables is defined as 







I 



(52) 



(53) 



The matrix U, though invertible, is not guaranteed to be well conditioned. This is a well- 
known problem in QR decomposition [T7j. However, we can drop the rows of [U V] where the 
diagonal entry of U is small, and one can argue that the constraints that were transformed 
into rows with negligible leading entry are (numerically) redundant. Our implementation of 
the change of variables uses QR decomposition by the LAPACK routine DGEQP3. 

To compare with the change of variables from [111 120] , consider the case where there is 
just one average with unit weights. Then 



i/r=[l,...,l], U=[l], K = I, 



1 1 
1 



(54) 



and we have the change of variables 



Wi 



T,wT 



Ti = H, 



-1 



while the transformation of variables by [HI 

1 -1 



1 



1 -1 
1 



is defined as 



(55) 



(56) 



1 1 

With the change of basis, the BDDC preconditioner can be written as 
Mbddc ■.r^u={I-P) ETRw, w eW : {ATw, Tz) = (r, (J - P) ETz) 



where T = diag [Tj]. Thus, A is replaced by the transformed matrix T"^ AT, and, by assembly 
at corners following (20), A'^ becomes R'^^T^ATR'^. Then, Lemma [l] yields the matrix form 
of the algorithm: solving the system 



RcTrpT^ATR^Wc 



+ D X 



R-'^T^E^r, 
0, 



(57) 
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followed by computation of the approximate solution u E W hj u = {I — P) ETR^Wc- Here, 
the matrix D = D'^T is much sparser than D'^ thanks to the change of variables. It couples 
only the new explicit degrees of freedom on each subdomain and thus has only one +1 and 
one —1 entry on each row. In fact, the construction of D is similar to the construction of 
the operator B used in FETl methods. In computations, D can be constructed directly 
without using either D'^ or T, knowing only which pairs of the (explicit) interface degrees of 
freedom should be coupled after the change of basis. 



Instead of solving the saddle point problem (57) directly, we now use the projection as 
in ( [49I with D'^ replaced by D^, resulting in a new projection 11. The sparsity structure of 
IT and of the projected matrix UR'^'^T^ ATRm + t{I - IT) are illustrated Figure 3 As can 
be observed, change of basis preceding the projection can lead to much lower fill-in in the 
off-diagonal blocks of the projected matrix. 

The BDDC preconditioner can be finally rewritten in the algebraic form, which is actually 
used in our implementations, as follows. 

Algorithm 2. The action of the BDDC preconditioner Mbddc '■ r ^ w with the generalized 
change of variables consists of solving the system 

Aw^ = TlR'^T^E'^r, (58) 

where 

A = UR^^T^ATRm + t{I-T[), (59) 
with an arbitrary t > 0, followed by w = {I — P) ETR'^Wc- 

Remark 3. Since the transformation of variables changes averages into separate degrees 
of freedom, one can treat these degrees of freedom as corners and assemble them just as 
in ni\ [TSj WW to make all constraints primal. This gives no additional fill-in beyond the 
one caused by the change of variables, i.e., replacing A'^ by R'^^T^ATR'^. In the adaptive 
method (Section^, the corners are already set and used to compute the constraints to be 
added adaptively. Treating all constraints as corners then requires redefining which variables 
are corners. This is not supported in the code described here. See Section^ for more details. 

6. Implementation 

We have implemented the proposed method in Matlab. Later, we have developed also 
a parallel version using Fortran 90 programming language and MPl. 



First, we have implemented the BDDC preconditioner based on the formulation (26). 
In the case of corner constraints only, i.e. D'^ is empty, the method is reduced to solving 
a problem with matrix A'^ in each iteration. In the current version, we rely on the parallel 
direct solver MUMPS ^ (version 4.8.4) for this purpose. 

In the implementation, two separate instances of MUMPS are necessary - one for solving 
problems with matrix A'^ and another for a realization of the operator / — P of the discrete 



harmonic extension in (13) globally. The latter is equivalent to solving an independent 



discrete Dirichlet problem on each subdomain. 
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In the case of a nontrivial matrix D'^^ i.e. for additional constraints on edges and/or 
faces, explicit change of variables with projection (Section |5| is performed in parallel to form 
the distributed sparse matrix (59), which is then supplied to MUMPS. We have observed 
a great advantage in projecting the matrix after the change of variables compared to the 
direct projection on null D'^. It significantly decreases the computational time and memory 
consumption due to the reduced fill-in, as described in Section |5j In our experience, the 
amount of extra work needed for the transformation and the projection is typically only 
a small fraction of the time saved by the lower number of PCG iterations, compared to the 
case of corner constraints only. 

Using the projection instead of re-assembling the matrix after the change of variables 
allows us to store the sparse matrix in memory only once, and use it in the preconditioner as 
well as in the PCG method, which is formulated to run on vectors from the space W^. For 
the preconditioner, new entries arising from the transformation and projection are stored in 
the memory behind the original matrix and the convention of repeated indices allowed by 
MUMPS is exploited. 

Later, the adaptive selection of constraints described in Section [3] has been added to the 
implementation. As the parallelization of solving the generalized eigenvalue problems (41) 
on pairs of adjacent subdomains does not follow the scheme of the natural parallelization 
by subdomains, this part of the code has been written as a self-standing module that just 
passes the constraints to the main BDDC solver. Multiplication by S^j in the eigenvalue 
problem (41 ) is implemented by performing the interior correction on each of the two adjacent 
subdomains separately, and only the resulting vectors are assembled; thus, the matrices Sf^ 
and for the two adjacent substructures are not formed explicitly. 



7. Numerical results 

We have tested the adaptive algorithm on several three-dimensional problems of linear 
elasticity coming from engineering practice. As a consistency check, we have also tested the 
method in two dimensions with essentially the same results as in [SB] . The computations were 
done in Matlab and by the parallel implementation described in Section [6j In Matlab, the 
generalized eigenvalue problems for pairs of adjacent substructures were solved by explicit 
construction of the matrices and by standard methods for symmetric eigenvalue problems. 
We have also tested both the Matlab and the C versions of the LOBPCG algorithm [15j. 
The averaging operator was constructed with weights proportional to the diagonal entries of 
the substructure matrices before elimination of interiors. 

The first problem is a nozzle box of a SKODA steam turbine 28 MW for the electric 
power plant Novaky, Slovakia, loaded by steam pressure. The body of the nozzle box was 
discretized using 2696 isoparametric quadratic finite elements with 40, 254 degrees of freedom 
and decomposed into 16 substructures with 37 corners, 19 edges, and 32 faces (see Fig. |4]). 
Convergence of the algorithm with non-adaptive constraints is displayed in Table [TJ Note 
that the corner coarse degrees of freedom were not sufficient to guarantee convergence. Where 
"3eigv" is added, constraints corresponding to three dominant eigenvalues are added at each 
face. This choice leads to the same number of constraints as using simple arithmetic averages. 
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Comparing the last two rows in Table [T| we can see that constraints obtained from the 
adaptive algorithm work quite better than arithmetic averages. Our explanation is that such 
constraints might approximate better the direction of global eigenvectors corresponding to 
the extreme eigenvalues. Table |2] then contains results obtained using the adaptive selection 
of constraints. Each row corresponds to a different value of the threshold r, i.e. the target 
bound of the condition number indicator u. All eigenvectors corresponding to eigenvalues 
greater or equal to r were used to generate adaptive constraints. Comparing the results in 
Tables [l}]2} we can see that the adaptive method leads to a redistribution of the number of 
constraints on different faces. For example, with r = 20, the total number of constraints is 
still lower than by using arithmetic averages on all faces, while the number of iterations is 
improved by almost 25% and the condition number estimate k is improved by more than 
50%. 

The second problem is a beam with a mesh refinement around a notch. It is discretized 
using 245, 687 tetrahedral finite elements with 143, 451 degrees of freedom, and decomposed 
into 8 substructures with 31 corners, 18 edges, and 19 faces (see Fig. [s]). The results with 
non-adaptive constraints are summarized in Table [3| and results of the adaptive method 
are presented in Table |4j Comparing these two tables, we can see that, similarly as for 
the nozzle box problem, doubling the number of constraints reduces number of iterations to 
a half. Nevertheless, for both problems, the adaptive algorithm leads to a relatively small 
improvement in terms of the number of iterations and of the condition number estimate. This 
indicates that for these problems the simple arithmetic averages already work well enough, 
and there are no interfaces that would require extra work - the quality of the decomposition 
is uniform, as seen in Figs. |4]-[5| 

On the other hand, the power of the adaptive algorithm seems to be very beneficial for 
finite element discretization with bad aspect ratios of elements. An example of such problem 
is a bridge construction discretized by 39, 060 hexahedral finite elements with 157, 356 degrees 
of freedom, and distributed into 16 substructures with 250 corners, 30 edges, and 43 faces 
(see Fig.j6|. The results are summarized in Tables |5]-[6} Comparing the last two rows 
in Table [5]we can see, that relatively poor convergence with arithmetic averages improves 
quite significantly when arithmetic averages over faces are replaced by the same number 
of adaptive averages. Moreover, from Table [6] we see that, e.g., doubling the number of 
constraints with r = 5 decreases the number of iterations more than six times, and with 
r = 2 the number of iterations is reduced more than ten times while the number of constraints 
increases approximately four times. 

In order to test the performance of the algorithm in the presence of jumps in material 
coefficients, we have created a problem of a cube with material parameters = 10^ Pa and 
V = 0.45, penetrated by four bars with parameters E = 2.1 ■ 10^^ Pa and u = 0.3, consisting 
of 107, 811 degrees of freedom, and distributed into 8 substructures with 30 corners, 16 edges, 
and 15 faces (see Fig. [t] and note that the bars cut the substructures only through faces). 
Similar problems are solved in practice to determine numerically (anisotropic) properties 
of composite materials. Comparing the results in Tables [7] and |8] we can see, that with 
r = 10, 000 only 10 additional averages over faces are used to decrease the number of 
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iterations 2.6 times. With r = 2, the number of iterations is decreased 10 times compared 
to the non-adaptive algorithm with arithmetic averages over all globs (c+e+f), whereas the 
number of constraints is increased less than 3 times. 

To test the parallel behaviour of the MUMPS solver at our applications, we run the 
solver only with corner coarse degrees of freedom on a benchmark problem consisting of 
cubic subdomains, the number of which is growing in two dimensions (see Fig. |8|. Since the 
size of the subdomains is fixed, the problem fixed at one side and loaded at the opposite side 
is changing its nature, and consequently, some growth in number of iterations is expected. 
The sequence of problems is run on an increasing number of processors, which matches the 
number of subdomains. Presented computations were performed on 1.5 GHz Intel Itanium 2 
processors of SGI Altix 4700 computer in CTU Supercomputing Centre, Prague. 

We can see in Table |9| that after a jump in times between 16 and 25 subdomains related 
probably to the computer's architecture, the times of analysis, factorization, as well as per one 
iteration remain almost constant. The "total wall time" includes also the second factorization 
of MUMPS for computing the discrete harmonic extensions and all I/O operations. 

In the presented algorithm, a considerable amount of work is spent by generating the 
adaptive constraints. This part, which consists of solving local eigenproblems, may eventually 
dominate the whole computation. Each of these eigenproblems is common to a pair of 
subdomains, so the parallelism is different from the "natural" subdomain-wise parallelism 
of domain decomposition. For this reason, an independent distribution of eigenproblems 
among processors is performed, and each processor which solves an eigenproblem is linked 
with the two processors which store the data of subdomains within the pair. 

An investigation of scaling of this part of the implementation on a variable number of 
processors was performed for the problem of turbine nozzle box with 16 subdomains. In 
Fig. |9| the summary for the sequence of 2^^, k = 0,1,2, ... , 5, processors is shown. 

8. Conclusion 

The adaptive BDDC method has been presented. The paper contains several original 
contributions. First, the definition of space where BDDC runs is given as the nuUspace 
of a global matrix of constraints. For an efficient and straightforward implementation of 
this formulation, a generalization of the change of variables is proposed. This allows an 
efficient handling of multiple arbitrary constraints on a substructure face. This functionality 
is required for the implementation of constraints that are generated adaptively. The adaptive 
selection of constraints from [26] has been reformulated in a mathematically equivalent way to 
use only the operators of BDDC and to match the overall approach of the rest of this paper 
to minimize programming requirements. The adaptive method is based on simultaneous 
solution of generalized eigenvalue problems defined for each face in the decomposition. The 
eigenvalues serve as a condition number indicator, so the minimal number of constraints is 
added to guarantee that the condition number indicator is below a given threshold. The 
corresponding eigenvectors are used to derive the coefficients of the constraints. Numerical 
experiments confirm that the eigenvalues provide a good prediction of the final condition 
number of the preconditioned operator. 
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A parallel implementation of the method has been developed and presented. It is based 
on a global formulation of the matrix of the BDDC preconditioner, and it is built on top of 
solver MUMPS, which provides most of the parallelism and minimizes custom coding. The 
implementation has been tested on a number of problems of 3D elasticity. Results for several 
real world problems are included. 

In our experiments, adaptive BDDC has shown to be quite powerful. Many times, it has 
been able to save the situation for poorly selected corners, even in the case of disconnected 
subdomains, the situation often faced in real applications of domain decomposition when 
using graph partitioners. Adaptive BDDC is able to handle very ill-conditioned problems 
(e.g. problems with jumps in coefficients, complicated geometries with deformed elements, 
etc.), which are almost impossible to solve by standard BDDC method using only arithmetic 
averages on edges and faces. Such problems would either require a prohibitive number of 
PCG iterations or may not converge at all. This class of problems is the target application 
of the proposed method, as the extra cost of generating the constraints adaptively is not 
negligible and would not pay for well-conditioned problems. 

The solution of local eigenproblems by LOBPCG in generation of the adaptive constraints 
requires many iterations and accounts for most of the time for some problems. A suitable 
preconditioning of these eigenproblems to reduce the number of LOBPCG iterations will be 
studied elsewhere. 
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Figure 1: Example of an actual mesh (top) and the corresponding fictitious mesh for construction of space 
W (bottom), the dots mark corners. 




Figure 2: Illustration of two adjacent subdomains in 2D for the computation of the condition number 
indicator. 
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(a) nnz = 136, 937 



(b) nnz = 141,773 




(e) nnz = 228, 954 (f) nnz = 156, 982 



Figure 3: Sparsity patters for 3D elasticity problem for a cube decomposed into 2x2x2 substructures 
{H/h = 4) with 7 corners, 6 edges, 12 faces, and 2187 degrees of freedom. The matrix (resp. D ) 
contains 54 rows to enforce the equality of arithmetic averages over edges. The matrices (a), (c), (e) are in 
the original degrees of freedom, while (b), (d), (f) are after the change of variables ([53|: the operators A'^ 
in panel (a) and R'^'^T^ATR'^ in panel (b), projections 11 in panel (c) and 11 in panel (d), and projected 
operators nA=n + i(/-n) in panel (e) and UR'^^T'^ ATR^U + t{I - Tl) in panel (f). All are square matrices 
with size 2925. 
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Figure 4: Finite element discretization and substructuring of the nozzle box, consisting of 40, 254 degrees of 
freedom, 16 substructures, 37 corners, 19 edges, and 32 faces. 



constraints 


Nc 


K 


it 


c 





NA 


NA 


c+e 


117 


1021.7 


103 


c+e+f 


213 


40.3 


47 


c+e+f (3eigv) 


213 


26.5 


40 



Table 1: Results for the turbine nozzle box problem. The first three rows correspond to non-adaptive 
approach with corner constraints and arithmetic averages over edges/faces, and the last row corresponds to 
corner constraints with arithmetic averages over edges and three weighted averages over faces obtained from 
eigenvectors of the local generalized eigenvalue problems, Nc is number of constraints (rows in the matrix £)), 
K is the approximate condition number estimate from the Lanczos sequence in conjugate gradients, and it 
is the number of iterations for relative residual tolerance 10^^. 
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T 


u 


Nc 




it 


oo(=c+e) 


NA 


117 


1021.690 


103 


50 


49.772 


158 


44.8781 


48 


20 


19.824 


200 


16.8938 


36 


10 


9.965 


274 


11.171 


27 


5 


4.998 


408 


8.820 


20 



Table 2: Results for the turbine nozzle box problem using the adaptive approach, r is the threshold, and lo 
is the condition number indicator from (38). The other headings are same as in Table [ij 





Figure 5: Finite element discretization and substructuring of the beam with a notch, consisting of 143,451 
degrees of freedom, 8 substructures, 31 corners, 18 edges, and 19 faces. 



constraint 


Nc 




it 


c 





127.1 


79 


c+e 


111 


101.0 


61 


c+e+f 


168 


22.4 


32 


c+e+f (3eigv) 


168 


13.2 


30 



Table 3: Results for the beam with a notch. The headings are same as in Table [T] 
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T 




Nc 




it 


oo(=c+e) 


149.0 


111 


101.0 


61 


20 


18.272 


119 


19.011 


41 


10 


9.986 


134 


8.505 


31 


5 


4.994 


163 


4.655 


24 


3 


2.998 


215 


2.873 


18 


2 


1.993 


340 


2.145 


14 



Table 4: Results for the beam with a notch. The headings are same as in Table [2j 




Figure 6: Finite element discretization and substructuring of the bridge construction, consisting of 157,356 
degrees of freedom, 16 substructures, 250 corners, 30 edges, and 43 faces. 



constraint 


Nc 




it 


c 





2301.4 


224 


c+e 


180 


2252.4 


220 


c+e+f 


309 


653.6 


160 


c+e+f (3eigv) 


309 


177.8 


103 



Table 5: Results for the bridge construction. The headings are same as in Table [T] 



25 



T 




Nn 
i V c 


K 


<t 


(X)(=c+e) 


6500.5 


180 


2252.4 


220 


650 


589.338 


185 


483.517 


169 


30 


29.568 


292 


28.739 


64 


5 


4.997 


655 


5.014 


26 


2 


1.998 


1301 


2.011 


14 



Table 6: Results for the bridge construction. The headings are same as in Table [2] 




Figure 7: Finite element discretization and substructuring of the cube with jumps in coefficients, consisting 
of 107,811 degrees of freedom, 8 substructures, 30 corners, 16 edges, and 15 faces. 



constraint 


Nc 




it 


c 





408,101.0 


326 


c+e 


108 


125,390.0 


234 


c+e+f 


153 


18,914.9 


169 


c+e+f (3eigv) 


153 


1266.4 


71 



Table 7: Results for the cube with jumps in coefficients. The headings are same as in Table [T] 



r 




Nc 




it 


oo(=c+e) 


270,000.0 


108 


125,390.0 


234 


10,000 


5145.293 


118 


1843.35 


90 


1,000 


380.019 


129 


173.562 


35 


100 


77.189 


132 


6.423 


24 


5 


4.990 


173 


4.362 


20 


2 


1.998 


451 


2.803 


16 



Table 8: Results for the cube with jumps in coefficients. The headings are same as in Table [2] 
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Figure 8: Example of a configuration of planar cubes test problem with 36 subdomains and H/h = 8, red 
dots represent corners. 



number of subdomains 
degrees of freedom 


4 

7803 


9 

16,875 


16 
29,403 


25 
45,387 


36 
64,827 


49 
87,723 


64 
114,075 


condition number est. 


28.3 


38.0 


42.2 


44.4 


45.7 


46.5 


47.1 


number of PCG iterations 


13 


26 


36 


42 


44 


46 


47 


analysis by MUMPS (sec) 


0.2 


0.5 


1 


15 


14 


16 


19 


factorization by MUMPS (sec) 


0.5 


0.4 


0.8 


12 


10 


12 


14 


PCG iterations (sec) 


0.8 


3.6 


13 


613 


524 


579 


643 


one PCG iteration (sec) 


0.06 


0.14 


0.4 


15 


12 


13 


14 


total wall time (sec) 


3 


6 


19 


715 


616 


696 


794 



Table 9: Weak scaling on planar cubes problem (e.g. Fig. |8]), corners only, H/h = 8. 
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Figure 9: Dependence of the computational time on the number of processors for solution of local 
eigenproblems, nozzle box problem, 16 subdomains, 30 eigenproblems. 
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